Medical charges vary widely by region and type of work and other factors. This project aims to claim a doctor who charges high a fraud. We need to conclude what is a fair comparison, by considering the type of medical work (DRG) and state,
First, we construct features/variables to identify fraud. Then, we use Kmeans clustering method to identify outliers.
Dataset description: https://data.cms.gov/Medicare-Inpatient/National-Summary-of-Inpatient-Charge-Data-by-Medic/efwk-h4x3
import numpy as np
import pandas as pd
df = pd.read_csv('inpatientCharges.csv')
Variable description
DRG Definition: Classification system that groups similar clinical conditions (diagnoses) and the procedures furnished by the hospital during the stay.
Total discharges: The number of discharges billed by the provider for inpatient hospital services. When you leave a hospital after treatment, you go through a process called hospital discharge.
Average Covered Charges = Total Covered Charge Amount / Total Discharges, Total Covered Charge Amount = the sum of all covered charges, Covered Charges: Charges for covered services that your health plan paid for.
Average Total Payments = Total Payments / Total Discharges, Payment is the amount a hospital actually receives for providing patient care. This is the actual amount paid to a hospital by consumers, insurers or governments.
Average Medicare Payments = Medicare Payment Amount / Total Discharges, Medicare Payment Amount: The average amount that Medicare pays to the provider for Medicare's share of the MS-DRG.
# put rows have the same provider id together, sort values for provider id
df = df.sort_values('Provider Id', ascending = True)
df.head(3)
# 163065 columns, 12 rows
df.shape
# examine column names
df.columns
# check data types
df.dtypes
# convert last three 'object' columns to float
df[df.columns[9:]] = df[df.columns[9:]].replace('[\$,]', '', regex=True).astype(float)
# rename column names
df.columns = ['DRG Definition', 'Provider Id', 'Provider Name',
'Provider Street Address', 'Provider City', 'Provider State',
'Provider Zip Code', 'Hospital Referral Region Description',
'Total Discharges', 'Average Covered Charges',
'Average Total Payments', 'Average Medicare Payments']
No missing value in this dataset
# by using .info(), we can see there is no na value in the dataset
df.info()
# alternatively, we can use df.isnull() to check missing values
df.isnull().sum()
# Here we can see the average and standard devition for average covered charged, average total payments, and average medicare payment
# we can see average covered charged is more than average total payments, which is more than average medicare payments
df.describe()
# graph 1: distribution of total discharges
import matplotlib.pyplot as plt
plt.hist(df['Total Discharges'], 30, range=[0, 500], align='mid')
plt.ylabel('total discharges', fontsize=14)
plt.title('total discharges distribution', fontsize=14)
# graph 2: distribution of Average Covered Charges
plt.hist(df['Average Covered Charges'], 30, range=[0, 400000], align='mid')
plt.ylabel('Average Covered Charges', fontsize=14)
plt.title('Average Covered Charges', fontsize=14)
we can see from the graph that average total payments tend to be 10 times less than average covered charges
# graph 3: distribution of Average Total Payments
plt.hist(df['Average Total Payments'], 30, range=[0, 40000], align='mid')
plt.ylabel('Average Total Payments', fontsize=14)
plt.title('Average Total Payments', fontsize=14)
we can see from the graph that the distribution of average Medicare payments is very similar to average total payments
# graph 4: distribution of Average Medicare Payments
plt.hist(df['Average Medicare Payments'], 30, range=[0, 40000], align='mid')
plt.ylabel('Average Medicare Payments', fontsize=14)
plt.title('Average Medicare Payments', fontsize=14)
Average covered charged by state
# graph 5: Average Covered Charges by State
import seaborn as sns
import matplotlib.pyplot as plt
a4_dims = (15.7, 8.27) # default figure size
fig, ax = plt.subplots(figsize=a4_dims)
sns.set(style="whitegrid")
sns.boxenplot(x="Provider State", y="Average Covered Charges",
color="b", scale="linear", data=df,ax=ax)
ax.set_title('Average Covered Payment Hospital Received by State')
Average total payment recieved by hospitals by state
# graph 6: Average total payment recieved by hospitals by state
a4_dims = (15.7, 8.27) # default figure size
fig, ax = plt.subplots(figsize=a4_dims)
sns.set(style="whitegrid")
sns.boxenplot(x="Provider State",
y="Average Total Payments",
color="b", scale="linear", data=df,ax=ax)
ax.set_title('Average Total Payment Hospitals received by State')
Average total payment recieved by hospitals by state
# graph 7: Average total payment recieved by hospitals by state
a4_dims = (15.7, 8.27) # default figure size
fig, ax = plt.subplots(figsize=a4_dims)
sns.set(style="whitegrid")
sns.boxenplot(x="Provider State",
y="Average Medicare Payments",
color="b", scale="linear", data=df, ax=ax)
ax.set_title('Average Medicare Payments Hospitals received by State')
we use aggregation strategy to generate features:
we use median, given the dataset has extreme values
--- discharge by provider
--- ratio by clinical condition ----
--- ratio of discharge by city ----
Why the feature can identify fraud: we can examine by selected city, which hospital has extreme number of total discharges. We can further examine the outliers.
# create variable: discharge_by_provider
discharge_by_provider = df.groupby(['Provider Name', 'Provider Id', 'Provider City'])['Total Discharges'].sum()
discharge_by_provider = pd.DataFrame(discharge_by_provider)
discharge_by_provider.columns = ['discharge_by_provider']
df2 = pd.merge(df, discharge_by_provider, how='left', on=['Provider Name', 'Provider Id', 'Provider City'])
df2.head(2)
# bring index to columns
discharge_by_provider = discharge_by_provider.reset_index()
# select city: NEW YORK
selected = discharge_by_provider.loc[discharge_by_provider['Provider City'] == 'NEW YORK']
# here we can see the total number of discharge by provider in New York
selected.head(2)
There are 11 providers in New York, from the graph below we can see the number of discharges for each provider in NYC.
From the graph we can see, New York Presbyterian hospital has the most number of discharges.
# data visualization
# select city = 'New York'
# x: Provider Name
# y: discharge_by_provider
import plotly
import plotly.express as px
fig = px.scatter(selected, x="Provider Name", y="discharge_by_provider", color = 'discharge_by_provider', size = 'discharge_by_provider',
title = "Total number of discharge for providers in New York",
size_max=60,width=800, height=500)
fig.show()
#sns.set(style="whitegrid")
#sns.boxplot(x="Provider Name", y="discharge_by_provider",
# data=selected_df2)
# examine which city is included in the dataset
#for i in df2['Provider City'].unique():
# print(i)
Why the feature can identify fraud: we can examine by selected city, the ratio of average coverage for one clinical condition to the average of coverage for that clinical condition in the state
# create variable: average_coverage_by_DRG_by_state
median_coverage_by_DRG_by_state = df2.groupby(['DRG Definition', 'Provider State'])['Average Covered Charges'].median()
# create average spending group by agency and merchant category
median_coverage_by_DRG_by_state = pd.DataFrame(median_coverage_by_DRG_by_state)
median_coverage_by_DRG_by_state.columns = ['median_coverage_by_DRG_by_state']
# add the new feature average spending to the dataset
df2 = pd.merge(df2, median_coverage_by_DRG_by_state, how='left', on=['DRG Definition', 'Provider State'])
df2.head(5)
# Feature 2: ratio of amount spending and avergae spending by Description of transaction
df2['ratio_coverage_by_state'] = df2['Average Covered Charges'] / df2.median_coverage_by_DRG_by_state
df2.head(2)
# select DRG Definition = '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', city = 'New York'
selected1 = df2.loc[df2['Provider City'] == 'NEW YORK']
selected2 = selected1.loc[selected1['DRG Definition'] == '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC']
There are 5 providers in New York provides '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC' discharge, from the graph below we can see NYU Hospitals Centers have 1.8 times higher coverage on this clinical condition compared to the average level in New York State.
fig = px.scatter(selected2, x="Provider Name", y="ratio_coverage_by_state", color = 'ratio_coverage_by_state', size = 'ratio_coverage_by_state',
title = "Ratio of coverage for selected clinical condition to median in New York",
size_max=60,width=800, height=500)
fig.show()
Why the feature can identify fraud:
# create variable: average_coverage_by_DRG_by_state
median_payment_by_DRG_by_state = df2.groupby(['DRG Definition', 'Provider State'])['Average Total Payments'].median()
# create average spending group by agency and merchant category
median_payment_by_DRG_by_state = pd.DataFrame(median_payment_by_DRG_by_state)
median_payment_by_DRG_by_state.columns = ['median_payment_by_DRG_by_state']
# add the new feature average spending to the dataset
df2 = pd.merge(df2, median_payment_by_DRG_by_state, how='left', on=['DRG Definition', 'Provider State'])
# Feature 2: ratio of amount spending and avergae spending by Description of transaction
df2['ratio_payment_by_state'] = df2['Average Total Payments'] / df2.median_payment_by_DRG_by_state
df2.head(2)
# select DRG Definition = '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', city = 'New York'
selected1 = df2.loc[df2['Provider City'] == 'NEW YORK']
selected2 = selected1.loc[selected1['DRG Definition'] == '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC']
There are 5 providers in New York provides '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC' discharge.
From the graph below we can see New York Presbyterian hospital received approximately 1.5 times higher payment on the selected clinical condition compared to the average level in New York State.
fig = px.scatter(selected2, x="Provider Name", y="ratio_payment_by_state", color = 'ratio_payment_by_state', size = 'ratio_payment_by_state',
title = "Ratio of actual payment to hospital for selected clinical condition to median in New York",
size_max=60,width=800, height=500)
fig.show()
Why the feature can identify fraud: we can examine which provider received most medicare payment, and examine the possibility of medicare fraud.
# create variable: average_medicare_by_DRG_by_state
median_medicare_by_DRG_by_state = df2.groupby(['DRG Definition', 'Provider State'])['Average Medicare Payments'].median()
# create average spending group by agency and merchant category
median_medicare_by_DRG_by_state = pd.DataFrame(median_medicare_by_DRG_by_state)
median_medicare_by_DRG_by_state.columns = ['median_medicare_by_DRG_by_state']
# add the new feature average spending to the dataset
df2 = pd.merge(df2, median_medicare_by_DRG_by_state, how='left', on=['DRG Definition', 'Provider State'])
# Feature 2: ratio of amount spending and avergae spending by Description of transaction
df2['ratio_medicare_by_state'] = df2['Average Medicare Payments'] / df2.median_medicare_by_DRG_by_state
df2.head(2)
# select DRG Definition = '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', city = 'New York'
selected1 = df2.loc[df2['Provider City'] == 'NEW YORK']
selected2 = selected1.loc[selected1['DRG Definition'] == '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC']
There are 5 providers in New York provides '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC' discharge.
From the graph below we can see BETH ISREAL MEDICAL CENTER received approximately 1.5 times higher payment on the selected clinical condition compared to the average level in New York State, and it is the highest compared to other hospitals.
fig = px.scatter(selected2, x="Provider Name", y="ratio_medicare_by_state", color = 'ratio_medicare_by_state', size = 'ratio_medicare_by_state',
title = "Ratio of medicare payment to hospital for selected clinical condition to median in New York",
size_max=60,width=800, height=500)
fig.show()
Why the feature can identify fraud:
# create variable: average_medicare_by_DRG_by_state
median_discharge_by_provider_by_city = df2.groupby(['Provider Id', 'Provider City'])['Total Discharges'].median()
# create average spending group by agency and merchant category
median_discharge_by_provider_by_city = pd.DataFrame(median_discharge_by_provider_by_city)
median_discharge_by_provider_by_city.columns = ['median_discharge_by_provider_by_city']
# add the new feature average spending to the dataset
df2 = pd.merge(df2, median_discharge_by_provider_by_city, how='left', on=['Provider Id', 'Provider City'])
# Feature 5
df2['ratio_average_discharge_by_provider_by_city'] = df2['Total Discharges'] / df2.median_discharge_by_provider_by_city
df2.head(2)
# select city = 'New York'
selected1 = df2.loc[df2['Provider City'] == 'NEW YORK']
selected1.head(2)
From the graph we can see, NYU Hospital Center has 10 times total discharges compared to the average level of discharges of NYC
fig = px.scatter(selected1, x="Provider Name", y="ratio_average_discharge_by_provider_by_city", color = 'ratio_average_discharge_by_provider_by_city', size = 'ratio_average_discharge_by_provider_by_city',
title = "Ratio of total number of discharge by provider to the median discharges in New York",
size_max=60,width=900, height=600)
fig.show()
# Standardization
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
df3 = df2.loc[:, ['ratio_coverage_by_state', 'ratio_payment_by_state',
'ratio_medicare_by_state', 'ratio_average_discharge_by_provider_by_city']]
# Standardize the data to have a mean of 0 and a variance of 1
X_std = StandardScaler().fit_transform(df3)
The graph below shows that
the first 3 components explain 96% of the variance in our data.
the first 2 components explain 76% of the variance in our data.
the first component explains 51% of variance in the data.
# Create a PCA instance: pca
pca = PCA(n_components=4)
principalComponents = pca.fit_transform(X_std)
# Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_ratio_)
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(features)
# Save components to a DataFrame
PCA_components = pd.DataFrame(principalComponents)
#plt.figure(gigsize = (10,8))
plt.plot(pca.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.xlabel('number of components')
plt.ylabel('cummulative explained variance')
# plot first 2 compnents in 2 dimensional space
plt.scatter(PCA_components[0], PCA_components[1], alpha=.1)
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
# view PCA compnenent
PCA_components
we will use k-means clustering to view the top 3 PCA components, given the first 3 components explain 96% of the variance in our data.
From the graph, we can see the elbow point is when k = 2 and k = 4, however we will also examine other number of clusters
# determine the best number of clusters
ks = range(1, 10)
inertias = []
for k in ks:
# Create a KMeans instance with k clusters: model
model = KMeans(n_clusters=k)
# Fit model to samples
model.fit(PCA_components.iloc[:,:3])
# Append the inertia to the list of inertias
inertias.append(model.inertia_)
plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
# write a function to get summary statistics for each k
def identify_outlier_by_kmeans(k, PCA_components, num_component):
from sklearn.cluster import KMeans
# implement k means
kmean = KMeans(n_clusters = k, random_state=1)
# fit data: we only use the first 3 components
kmean.fit(PCA_components.iloc[:,:num_component])
df_kmean = df2.copy()
# create new column and add it to original dataset
df_kmean['cluster']=kmean.labels_
# count by cluster
table1 = pd.DataFrame(df_kmean['cluster'].value_counts())
table1.columns = ['count']
# summary statistics
table2 = df_kmean.loc[:,['Total Discharges', 'Average Covered Charges', 'Average Total Payments', 'Average Medicare Payments',
'ratio_coverage_by_state','ratio_payment_by_state','ratio_medicare_by_state','ratio_average_discharge_by_provider_by_city',
'cluster']].groupby(['cluster']).mean()
table3 = pd.concat([table1, table2], axis = 1)
# return distribution by cluster and summary statistics
return table3
identify_outlier_by_kmeans(4, PCA_components, num_component=3)
identify_outlier_by_kmeans(5, PCA_components, num_component=3)
identify_outlier_by_kmeans(6, PCA_components, num_component=3)
output7 = identify_outlier_by_kmeans(7, PCA_components, num_component=3)
# group by cluster
# y: last 4 columns
output7['cluster'] = [1,2,3,4,5,6,7]
output7
# multiple line plot
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output7, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output7, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.plot( 'cluster', 'ratio_medicare_by_state', data=output7, marker='', color='olive', linewidth=2, linestyle='dashed', label="ratio_medicare_by_state")
plt.plot( 'cluster', 'ratio_average_discharge_by_provider_by_city', data=output7, marker='', color='red', linewidth=2, linestyle='dashed', label="ratio_average_discharge_by_provider_by_city")
plt.legend()
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
# # if we do not plot the variable about discharges
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output7, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output7, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.plot( 'cluster', 'ratio_medicare_by_state', data=output7, marker='', color='olive', linewidth=2, linestyle='dashed', label="ratio_medicare_by_state")
plt.legend()
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
we found ratio_payment_by_state and ratio_medicare_by_state are very similar, so we only keep one of them, ratio_payment_by_state
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output7, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output7, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.legend()
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
# percentage of cluster 2
8310/163065
output8 = identify_outlier_by_kmeans(8, PCA_components, num_component=3)
output8['cluster'] = [1,2,3,4,5,6,7,8]
output8
# percentage of cluster 2
# 5701/163065
# group by cluster
# y: last 4 columns
# multiple line plot
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output8, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output8, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.plot( 'cluster', 'ratio_medicare_by_state', data=output8, marker='', color='olive', linewidth=2, linestyle='dashed', label="ratio_medicare_by_state")
plt.plot( 'cluster', 'ratio_average_discharge_by_provider_by_city', data=output8, marker='', color='red', linewidth=2, linestyle='dashed', label="ratio_average_discharge_by_provider_by_city")
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
plt.legend(loc='upper left')
plt.show()
we found ratio_payment_by_state and ratio_medicare_by_state are very similar, so we only keep one of them, ratio_payment_by_state
# if we do not plot the variable about discharges
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output8, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output8, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
plt.legend(loc='upper left')
plt.show()
output9 = identify_outlier_by_kmeans(9, PCA_components, num_component=3)
output9['cluster'] = [1,2,3,4,5,6,7,8,9]
output9
# percentage of cluster 2
5620/163065
# group by cluster
# y: last 4 columns
output9['cluster'] = [1,2,3,4,5,6,7,8, 9]
# multiple line plot
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output9, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output9, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.plot( 'cluster', 'ratio_medicare_by_state', data=output9, marker='', color='olive', linewidth=2, linestyle='dashed', label="ratio_medicare_by_state")
plt.plot( 'cluster', 'ratio_average_discharge_by_provider_by_city', data=output9, marker='', color='red', linewidth=2, linestyle='dashed', label="ratio_average_discharge_by_provider_by_city")
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
plt.legend(loc='upper left')
plt.show()
# if we do not plot the variable about discharges
# multiple line plot
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output9, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output9, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.plot( 'cluster', 'ratio_medicare_by_state', data=output9, marker='', color='olive', linewidth=2, linestyle='dashed', label="ratio_medicare_by_state")
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
plt.legend(loc='upper left')
plt.show()
# we found ratio_payment_by_state and ratio_medicare_by_state are very similar, so we only keep one of them, ratio_payment_by_state
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output9, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output9, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
plt.legend(loc='upper left')
plt.show()
output10 = identify_outlier_by_kmeans(10, PCA_components, num_component=3)
output10['cluster'] = [1,2,3,4,5,6,7,8,9,10]
output10
# percentage of clusters
7051/163065
# group by cluster
# y: last 4 columns
# multiple line plot
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output10, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output10, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.plot( 'cluster', 'ratio_medicare_by_state', data=output10, marker='', color='olive', linewidth=2, linestyle='dashed', label="ratio_medicare_by_state")
plt.plot( 'cluster', 'ratio_average_discharge_by_provider_by_city', data=output10, marker='', color='red', linewidth=2, linestyle='dashed', label="ratio_average_discharge_by_provider_by_city")
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
plt.legend(loc='upper left')
plt.show()
# if we do not plot the variable about discharges
# multiple line plot
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output10, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output10, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.plot( 'cluster', 'ratio_medicare_by_state', data=output10, marker='', color='olive', linewidth=2, linestyle='dashed', label="ratio_medicare_by_state")
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
plt.legend(loc='upper left')
plt.show()
plt.plot( 'cluster', 'ratio_coverage_by_state', data=output10, marker='', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4, label="ratio_coverage_by_state")
plt.plot( 'cluster', 'ratio_payment_by_state', data=output10, marker='', color='orange', linewidth=2,label="ratio_payment_by_state")
plt.xlabel('cluster')
plt.title('Plot ratios by cluster')
plt.legend(loc='upper left')
plt.show()
the identified cluster when k = 10 has reasonable observations and differentiation compared with other clusters
cluster 5 contains 6.8% observations, in this cluster, health providers has 2 times higher actual total payment compared to the median level in this state, and 2.1 times higher coverage compared to state median